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Abstract 


Recent  developments  in  cognitive  psychology  suggest  models  for 
knowledge  and  learning  that  often  fall  outside  the  realm  of  standard  test 
theory.  This  paper  concerns  probability-based  inference  in  terms  of  such 
models.  An  approach  utilizing  Bayesian  inference  networks  is  outlined. 
Basic  ideas  of  structure  and  computation  in  inference  networks  are 
discussed,  and  illustrated  with  an  example  from  the  domain  of  mixed- 
number  subtraction. 

Key  words:  Bayesian  inference,  belief  nets,  cognitive  diagnosis, 

cognitive  psychology,  educational  measurement,  inference 
networks. 


Introduction 


The  psychological  paradigm  emerging  from  cognitive  psychology  suggests  new 
models  for  students’  capabilities — a  potentially  powerful  firamewoik  to  plan  instmction, 
evaluate  progress,  and  provide  feedback  to  students  and  teachers  (Snow  &  Lohmann, 

1989).  As  in  traditional  test  theory,  however,  we  face  problems  of  inference:  Just  what 
kinds  of  things  are  to  be  said  about  students,  by  themselves  or  others?  What  evidence  is 
needed  to  suppon  such  statements?  How  much  faith  can  we  place  in  the  evidence,  and  in 
the  statements?  How  do  we  son  out  elements  of  evidence  that  are  overlapping,  redundant, 
or  contradictory?  When  do  we  need  to  ask  different  questions  or  pose  addititxial  situatiems 
to  distinguish  among  competing  explanations  of  what  we  see? 

This  paper  discusses  a  probabilistic  framework  for  addressing  questions  like  these. 
The  essential  idea  is  to  define  a  space  of  “student  noodels” — simplified  characterizations  of 
students’  knowledge,  skill,  and/or  strategies,  indexed  by  variables  that  signify  their  key 
aspects.  From  theory  and  data,  one  posits  probabilities  for  the  ways  that  students  with 
different  configurations  in  this  space  will  solve  problems,  answer  questions,  and  so  on. 
This  done,  the  machinery  of  probability  theory  allows  one  to  reason  from  observations  of  a 
student’s  actions  to  likely  values  of  parameters  in  a  student  model. 

Recent  developments  in  statistical  theory  make  it  possible  to  carry  out  such 
inference  in  large  and  complex  systems  of  variables.  The  program  of  research  introduced 
here  is  beginning  to  explore  the  potential  of  this  approach  in  educational  assessment  and 
cognitive  diagnosis.  By  working  out  the  details  of  specific  illustrative  examples,  we  are 
learning  about  the  kinds  of  domains  and  student  models  that  are  practical  K>  address,  and 
starting  to  tackle  an  agenda  of  practical  engineering  challenges.  We  begin  with  an  overview 
of  inference  networks,  walking  through  a  simple  numerical  example  fiom  medical 
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diagnosis.  An  example  from  mixed-number  subtraction  illustrates  the  features  of  the 
approach  as  tilled  to  cognitive  assessment. 

Probability-Based  Inference 

Inference  is  reastming  fixxn  what  we  know  and  what  we  observe  to  e;q)lanations, 
conclusions,  or  predictions.  We  are  always  reasoning  in  the  presence  of  uncertainty.  The 
information  we  work  with  is  typically  incon^lete,  inconclusive,  amenable  to  more  than  one 
explanation.  We  attempt  to  establish  the  weight  and  coverage  of  evidence  in  what  we 
observe.  But  the  very  first  question  we  must  address  is  “Evidence  about  what?”  Schum 
(1987,  p.  16)  points  out  the  crucial  distinction  between  dam  and  evidence:  “A  datum 
becomes  evidence  in  some  analytic  problem  when  its  relevance  to  one  ^r  more  hypotheses 
being  considered  is  established.  . . .  [E]  vidence  is  relevant  on  some  hypothesis  if  it  either 
increases  or  decreases  the  likeliness  of  the  hypothesis.  Without  hypotheses,  the  relevance 
of  no  datum  could  be  established.”  In  educational  assessment  and  cognitive  diagnosis,  we 
construct  hypotheses  around  notions  of  the  nature  and  acquisition  of  knowledge  and  skill, 

Schum  distinguishes  three  types  of  reasoning,  the  distinctions  anx)ng  which  are 
central  to  this  presentation.  Deductive  reasoning  flows  from  generals  to  particulars,  within 
an  established  framework  of  relationships  among  variables — from  causes  to  effects,  from 
diseases  to  symptoms,  from  the  way  a  crime  is  committed  to  the  evidence  likely  to  be  found 
at  the  scene,  from  a  student’s  knowledge  and  skills  to  observable  behavior.  That  is,  under 
a  given  state  of  affairs,  what  are  the  likely  outcomes?  Inductive  reasoning  flows  in  the 
opposite  direction,  also  within  an  established  framework  of  relatitxiships — from  effects  to 
possible  causes,  from  symptoms  to  diseases,  from  observable  behavior  to  probable 
configurations  of  a  student’s  knowledge  and  skills.  Given  outcomes,  what  state  of  affairs 
led  to  them?  In  abductive  reasoning,  reasoning  proceeds  from  observations  to  a  new 
hypotheses,  new  variables,  or  new  relationships  among  variables.  “Such  a  ‘bottom-up’ 
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process  certainly  appears  similar  to  induedon;  but  there  is  an  argument  that  such  reasoning 
is,  in  fact,  different  from  induction  since  an  existing  hypothesis  collection  is  enlarged  in  the 
process.  Relevant  evidendaiy  tests  of  this  new  hypothesis  are  then  deductively  inferred 
from  the  new  hypothesis.”  (Schum,  1987,  p.20). 

The  diagnostic  approach  discussed  in  this  paper  consists  of  a  network  of  variables 
defining  the  student  model  space,  the  observable-outcome  space,  and  the  interrelationships 
among  them.  All  three  types  of  reasoning  play  a  role: 

•  Abductive  reasoning  guides  its  construction,  drawing  upon  research  results  and 
previous  practice  to  suggest  the  basic  structure  and  statistical  analyses  refine  it.  Fot 
example,  Piaget  (e.g.,  1960)  searched  painstakingly  for  commonalities  in  the 
development  of  children’s  proportional  reasoning  abilities  over  years  of  unique  learning 
episodes  of  individual  children.  Siegler’s  (1981)  characterization  of  children’s 
undastandings  of  balance-beam  problems  as  a  sequence  of  increasingly  sophisticated 
strategic  flowchans  captures  key  aspects  of  some  of  these  patterns,  and  provides  a 
basis  for  a  student  model  space  (Mislevy,  Yamamoto,  &  Anacker,  1992). 

•  Deductive  reasoning,  supplemented  by  parameter  estimation,  is  used  to  posit 
distributions  of  observable  variables  given  configurations  of  variables  in  the  student 
model.  In  Siegler’s  study,  this  corresponds  to  determining  how  a  child  with  a  given  set 
of  strategies  at  her  disposal  might  attack  a  given  balance-beam  problem,  in  terms  of 
distributions  of  expected  classes  of  actions. 

•  Inductive  reasoning,  embodied  in  the  algebra  of  probability  theory,  guides  reasoning 
from  observations  of  a  given  student  to  inferences  about  her  knowledge  and  skills,  in 
terms  of  updated  beliefs  about  student-model  variables.  This  corresponds  to 
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characterizing  our  beliefs  about  which  balaiK:e-beam  strategies  a  child  possesses  after 
seeing  her  responses  to  a  set  of  problems. 

•  Abductive  reasoning,  triggered  by  unexpected  patterns  in  data,  is  called  for  again  by  due 
results  of  the  inductive  reasoning  phase.  Sometimes  a  particular  child’s  responses  will 
not  be  consistent  with  any  of  the  student  models  in  the  simplified  framework;  inductive 
reasoning  within  this  framework  fails  to  provide  a  satisfactory  working  approximation 
of  her  knowledge  and  skill.  In  such  cases,  we  need  richer  data  to  support  further 
exploration,  to  generate  new  conjectures. 

A  key  concept  in  probability-based  inference  is  conditional  independence:  Defined 
generally,  one  subset  of  variables  may  be  related  in  a  population,  but  they  are  independent 
given  the  values  of  another  subset  of  variables.  In  cognitive  models,  relationships  anx>ng 
observations  variables  are  ’’explained”  by  unobservable  variables  that  characterize  aspects 
of  knowledge,  skill,  strategies,  and  so  on.  In  Thompson’s  (1982)  words,  we  ask  ”What 
can  this  person  be  thinking  so  that  his  actions  make  sense  fiom  his  perspective?”  or  “What 
organization  does  the  student  have  in  mind  so  that  his  actions  seem,  to  him,  to  form  a 
coherent  pattern?”  Judah  Pearl  argues  that  creating  such  intervening  variables  is  not  merely 
a  technical  convenience,  but  a  natural  element  in  human  reasoning: 

". .  conditional  independence  is  not  a  grace  of  nature  for  which  we  must 
wait  passively,  but  rather  a  psychological  necessity  which  we  satisfy 
actively  by  organizing  our  knowledge  in  a  specific  way.  An  important  tool 
in  such  organization  is  the  identification  of  intermediate  variables  that  induce 
conditional  independence  among  observables:  if  such  variables  are  not  in 
our  vocabulary,  we  create  them.  In  medical  diagnosis,  for  instance,  when 
some  symptoms  directly  influence  one  another,  the  medical  profession 
invents  a  name  for  that  interaction  (e.g.,  ‘syndrome,’  ‘complication,’ 

‘pathological  state’)  and  treats  it  as  a  new  auxiliary  variable  that  induces 
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conditional  independence:  dependency  between  any  two  interacting  systems 

is  fidfy  attributed  to  the  dependencies  of  each  on  the  auxiliary  variable.” 

Pearl,  1988,  p.  44. 

Conditional  independence  is  thus  a  conceptual  tool  to  structure  reasoning,  helping 
to  define  variables,  organize  relationships,  and  guide  deductive  reasoning.  In  educational 
and  psychological  measurement,  a  heritage  of  statistical  inference  built  around 
unobservable  variables  and  induced  conditional  probability  relationships  extends  back  to 
Spearman’s  (e.g.,  1907)  early  work  with  latent  variables,  to  Wright’s  (1934)  path  analysis, 
to  Lazarsfeld’s  (1950)  latent  class  models.  The  resemblance  of  the  inference  networks 
presented  below  to  LISREL  diagrams  (Joreskog  &  Sorbom,  1989)  is  no  accident!  Our 
work  shares  inferential  machinery  with  this  tradition,  but  extends  the  universe  of  discourse 
to  student  models  suggested  by  cognitive  psychology. 

Inference  Networks 

Probability-based  inference  in  complex  networks  of  interdependent  variables  is  an 
active  topic  in  statistical  research,  spurred  by  applications  in  such  diverse  areas  as 
forecasting,  pedigree  analysis,  troubleshooting,  and  medical  diagnosis  (e.g.,  Lauritzen  & 
Spiegelhalter,  1988;  Pearl,  1988).  Current  interest  centers  on  obtaining  the  distributions  of 
selected  variables  conditional  on  observed  values  of  other  variables,  such  as  likely 
characteristics  of  offspring  of  selected  animals  given  characteristics  of  their  ancestors,  or 
probabilities  of  disease  states  given  symptoms  and  test  results.  As  we  shall  see  below, 
conditional  independence  relationships,  as  suggested  by  substantive  theory,  play  a  central 
role  in  the  topology  of  the  network  of  interrelationships  in  a  system  of  variables.  If  the 
topology  is  favorable,  such  calculations  can  be  carried  out  in  real  time  in  large  systems  by 
means  of  strictly  local  operations  on  small  subsets  of  interrelated  variables  (“cliques”)  and 
their  intersections. 
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This  section  briefly  reviews  basic  concepts  of  construction  and  local  computation 
for  inference  networks.  Details  can  be  found  in  the  statistical  and  expert-systems  literature; 
Lauritzen  and  Spiegelhalter  (1988),  Pearl  (1988),  and  Shafer  and  Shenoy  (1988),  for 
example,  discuss  updating  strategies,  a  kind  of  generalization  of  Bayes  theorem.  Computer 
programs  are  commercially  available  to  carry  out  the  number-crunching  aspect  We  used 
Andersen,  Jensen,  Olesen,  and  Jensen’s  (1989)  HUGIN  program  and  Noetic  System’s 
(1991)  ERGO  for  the  examples  in  this  presentation. 

To  move  from  a  structure  of  interrelationships  among  variables  to  a  representation 
amenable  to  real-time  local  calculation,  the  steps  listed  below  are  taken.  The  first  two 
encompass  defining  the  key  variables  in  an  application  and  explicating  their 
interrelationships.  In  essence,  this  information  is  the  input  to  programs  like  ERGO  and 
HUGIN,  which  then  carry  out  Steps  3  through  7. 

Step  1 .  Recursive  representation  of  the  joint  distribution  of  variables. 

Step  2.  Directed  graph  representation  of  (1). 

Step  3.  Undirected,  triangulated  graph. 

Step  4.  Determination  of  cliques  and  clique  intersections 

Step  5.  Join  tree  representation. 

Step  6.  Potential  tables. 

Step  7.  Updating  scheme. 

Although  computer  programs  are  available,  it  is  useful  nevertheless  to  walk 
through  the  details  of  simple  example — to  watch  what  happens  inside  the  “black  box” — to 
develop  intuition  that  can  guide  more  ambitious  applications.  We  borrow  a  simple  example 
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from  Andreassen,  Jensen,  and  Olesen  (n.d.).  It  concerns  two  possible  diseases  a  particular 
patient  may  have,  flu  and  throat  infection  (FLU  and  THRINF),  and  two  possible 
symptoms,  fever  and  sore  throat  (FEV  and  SORETHR).  The  diseases  are  modeled  as 
independent,  and  the  symptoms  as  conditionally  independent  given  disease  states.  These 
relationships  are  depicted  in  Figure  1,  which  will  be  discussed  in  greater  detail  below.  All 
four  variables  can  take  values  of  “yes”  and  “no.”  We  assume  that  exactly  one  value 
characterizes  each  variable  for  a  patient,  although  we  may  not  know  these  values  with 
certainty.  We  employ  probabilities  to  express  our  states  of  belief.  We  note  in  passing  that 
it  would  be  possible  to  work  with  the  full  joint  distribution  of  the  four  variables  in  this 
example  directly,  using  the  textbook  form  of  Bayes  theorem  to  update  beliefs  of  disease 
states  as  symptoms  become  known.  This  approach  rapidly  becomes  infeasible  as  the 
number  of  variables  in  the  system  increases,  whereas  the  approach  described  below  has 
been  employed  in  networks  with  over  1(XX)  variables  (Andreassen,  Woldbye,  Falck,  & 
Andersen,  1987). 


[[Figure  1  about  here]] 

1 .  A  recursive  representation  of  the  joint  distribution  of  variables 

A  recursive  representation  of  the  joint  distribution  of  a  set  of  random  variables 
Xi, ...,  Xn  takes  the  form 

p(Xi,...,Xn)  =  p(X„IX„.i,...,Xi)  p(X„.ilX„.2....,Xi)...p(X2lXi)  p(Xi) 

=  n  p(XjlXj.i,...,Xi) , 

j=i  (1) 

where  the  term  for  j=l  is  defined  as  simply  p(Xi).  A  recursive  representation  can  be 
written  for  any  ordering  of  the  variables,  but  one  that  exploits  conditional  independence 
relationships  can  prove  more  useful  as  variables  drop  out  of  the  conditioning  lists.  This  is 
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where  substantive  theory  comes  into  play;  for  example,  modeling  conditional  probabilities 
of  symptoms  given  disease  states,  rather  than  vice  versa.  The  following  representation 
exploits  the  independence  of  FLU  and  THRINF  and  the  conditional  independence  of  FEV 
and  SORETHR: 

P{FEV,  SORETHR.  FLU.  THRINF) 

=  P(FEV  I  SORETHR.  FLU.  THRINF)  P(SORETHR  I  FLU.  THRINF)  P(FLU  I  THRINF)  P(THRINF) 

=  P(FEV  1  FLU.  THRINF)  P(SORETHR  I  FLU.  THRINF)  P(  FLU)  P(  THRINF).  (2) 

Equation  2,  like  Figure  1,  indicates  the  qualitative  dependence  structure  of  the 
relationships  among  the  variables  without  specifying  quantitative  values.  Constructing  the 
full  joint  distribution  from  the  recursive  representation  requires  the  specification  of 
conditional  probatelity  distributions  for  each  variable.  For  each  combination  of  values  of  a 
variable’s  parents,  this  matrix  gives  the  conditional  probabilities  of  each  of  its  potential 
values.  Associated  with  variables  having  no  parents,  such  as  FLU  and  THRINF,  is  a 
vector  of  base  rates  or  prior  probabilities.  We  shall  assign  to  both  FLU  and  THRINF  prior 
probabilities  of .  1 1  for  “yes”  and  .89  for  “no.”  This  might  correspond  to  base  rates  in  a 
reference  population  to  which  our  patient  belongs.  Conditional  probabilities  of  FEV  and  of 
SORETHR  given  all  combinations  of  FLU  and  THRINF  appear  in  Table  1.  In  practice, 
such  probabilities  would  be  determined  by  disease  theory,  physiological  principles,  and 
past  experience.  The  tabled  values  indicate  that. . . 

•  Throat  infection  usually  causes  a  sore  throat  whether  or  not  flu  is  also  present  (.91 
and  .90  respectively);  flu  alone  occasionally  leads  to  a  sore  throat  (.05),  but  the 
chances  of  a  sore  throat  without  either  flu  or  throat  infection  is  only  .01. 
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•  Having  both  flu  and  throat  infection  leads  almost  certainly  to  fever  (.99);  either 
disease  by  itself  leads  to  fever  with  probability  .90;  and  the  probability  of  fever 
when  neither  disease  is  present  is  only  .01. 

[[Table  1  about  here]] 

The  updating  schemes  discussed  below  assume  these  conditional  probabilities  are 
known  with  certainty.  In  practice,  of  course,  they  are  usually  not.  Current  research  in  the 
field  includes  characterizing  the  impact  this  source  of  uncertainty,  sequentially  improving 
estimates  as  additional  data  are  obtained,  and  incorporating  this  uncertainty  formally  by 
augmenting  the  network  with  variables  that  parameterize  the  extent  of  knowledge  about 
conditional  probabilities  (Spiegelhalter,  1989). 

2 .  A  directed  graph  representation  of  the  joint  distribution  of  variables 

Corresponding  to  the  algebraic  representation  of  Equation  1  is  a  graphical 
representation — a  directed  acyclic  graph  (DAG).  The  graph  inherits  its  “directedness”  and 
“acyclic”  properties  from  the  recursive  expression  of  the  distribution  in  Equation  1. 
Direction  comes  from  which  variables  are  written  as  conditional  on  others  in  the 
representation,  and  the  recursive  expression  prohibits  “cycles”  such  as  “A  depends  on  B,  B 
depends  on  C,  and  C  depends  on  A.”  Figure  1,  corresponding  to  Equation  2,  is  the  DAG 
for  our  example.  Each  variable  is  a  node  in  the  graph;  directed  arrows  run  from  “parents” 
to  “children,”'  indicating  conditional  dependence  relationships  among  the  variables. 

A  DAG  depicts  the  qualitative  structure  of  associations  among  variables  in  the 
domain.  Theory  about  the  domain  is  the  starting  point,  but  a  real  application  requires 
model-fitting,  model  evaluation,  and  model  refinement.  While  many  standard  techniques 
from  statistical  theory  are  useful  in  this  endeavor,  certain  complications  arise.  In  large 
networics,  for  example,  many  cases  will  be  incomplete;  there  is  no  practical  need  to  obtain 
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the  results  of  additional  detailed  diagnostic  tests  for  diseases  that  have  already  been  ruled 
out.  And  while  global  tests  of  model  fit  are  useful  in  comparing  alternative  models,  more 
focused  tests  checking  local  features  of  models  and  verifying  predictions  one  case  at  a  time 
are  more  useful  for  model  refinement  While  the  updating  schemes  discussed  below  take 
the  DAG  structure  as  given,  we  must  keep  in  mind  that  the  success  of  an  application 
ultimately  depends  on  the  care  and  thought  that  go  into  developing  that  struaure. 

3.  An  undirected,  triangulated  graph 

Starting  with  the  DAG,  one  drops  the  directions  of  the  associations  and  adds  edges 
as  necessary  to  meet  two  requirements.  First,  the  parents  of  a  given  child  must  be 
connected.  Secondly,  the  graph  must  be  triangulated;  that  is,  any  path  of  connections  from 
a  variable  back  to  itself  (a  loop)  consisting  of  four  or  more  variables  must  have  a  chord,  or 
“short  cut.”  Triangulation  is  necessary  for  expressing  probability  relationships  in  a  way 
that  lends  itself  to  coherent  propagation  of  information.  Kim  and  Pearl’s  (1983)  initial 
work  with  individual  variables  showed  how  to  carry  out  coherent  local  updating  in  singly 
connected  networks  of  variables,  or  networks  of  variable  associations  with  no  loops  at  all. 
Most  networks  are  not  singly  connected,  however.  Even  our  simple  example  has  loops; 
for  example,  one  can  stan  a  path  at  FEVER,  follow  a  connection  to  FLU,  then  to 
SORTHR,  then  to  THRINF,  and  finally  return  to  FEVER. 

The  more  recent  updating  schemes  discussed  here  generalize  Kim  and  Pearl’s  ideas 
by  arranging  variables  into  subsets  called  cliques,  in  a  way  such  that  the  cliques  form  a 
singly-connected  graph.  Generalizations  of  Kim  and  Pearl’s  approach  can  then  be  applied 
at  the  level  of  cliques.  Triangulating  the  original  graph  of  variables  guarantees  that  a 
singly -connected  clique  representation  can  be  constructed  (Jensen,  1988).  A  triangulation 
scheme  is  not  necessarily  unique,  and  various  algorithms  have  been  developed  to  construct 
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triangulated  graphs  that  support  efficient  calculation  (e.g.,  Taijan  &  Yannakalds,  1984). 
Figure  2  is  the  undirected,  triangulated  graph  for  our  example. 

[[Figure  2  about  here]] 

4.  Determination  of  cliques  and  clique  intersections 

From  the  triangulated  graph,  one  determines  cliques,  subsets  of  variables  that  are 
all  linked  pairwise  to  one  another.  Cliques  overlap,  with  sets  of  overlapping  variables 
called  clique  intersections.  Cliques  and  clique  intersections  consdtute  the  structure  fca*  local 
updating.  Figure  3  shows  the  two  cliques  in  our  example,  (FEVER,  FLU,  THRINF]  and 
(FLU,  THRINF,  SORTHR}.  The  clique  intersection  is  (FLU,  THRINF}. 

[[Figure  3  about  here]] 

Just  as  there  can  be  multiple  ways  to  produce  a  triangulated  graph  from  a  given 
DAG,  there  can  be  multiple  ways  to  define  cliques  from  a  triangulated  graph.  Algorithms 
for  determining  a  clique  structure  that  supports  efficient  calculation  are  also  a  focus  of 
research.  The  amount  of  computation  grows  roughly  geometrically  with  clique  size,  as 
measured  by  the  number  of  possible  configurations  of  all  values  of  all  variables  in  a  clique. 
A  clique  representation  with  many  small  cliques  is  therefore  preferred  to  a  representation 
with  a  few  larger  cliques.  Strategies  for  increased  efficiency  at  this  stage  include  redefining 
variables,  adding  variables  to  break  loops,  and  dropping  associations  when  the 
consequences  are  benign. 

5  .  Join  tree  representation 

A  join-tree  representation  depicts  the  singly-connected  structine  of  cliques  and 
clique  intersections.  This  is  the  structure  through  which  local  updating  flows.  A  join  tree 
exhibits  the  running  intersection  property:  If  a  variable  appears  in  two  cliques,  it  appears  in 
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all  cliques  and  clique  intersections  in  the  single  path  connecting  them.  Figure  4  gives  the 
join-tree  for  our  example. 

[[Figure  4  about  here]] 

6 .  Potential  tables 

Local  calculation  is  earned  out  with  tables  that  convey  the  joint  distributions  of 
variables  within  cliques,  or  potential  tables.  Similar  tables  for  clique  intersections  are  used 
to  pass  updating  information  from  one  clique  to  another.  The  potential  tables  in  Ttti>le  2 
indicate  the  initial  status  of  the  networic  for  our  example;  that  is,  before  specific  knowledge 
of  a  particular  individual’s  symptoms  or  disease  states  becomes  known.  For  example,  the 
potential  table  for  Clique  1  is  calculated  using  the  prior  probabilities  of  .1 1  for  both  flu  and 
throat  infection,  the  assumption  that  they  are  independent,  and  the  conditions’  probabilities 
of  sore  throat  for  each  flu/throat-infection  combination. 

[[Table  2  about  here]] 

The  initial  probability  for  fever  can  be  obtained  by  marginalizing  the  potential  table 
for  Clique  1  with  respect  to  flu  and  throat  infection.  This  amounts  to  summing  down  the 
“FEVER:  yes”  column,  yielding  a  value  of  .20.  Similarly,  the  initial  probability  for  sore 
throat  is  obtained  by  summing  down  the  “SORTHR:  yes”  column  in  the  potential  table  for 
Clique  2,  yielding  .11. 

7.  Local  updating 

Absorbing  new  evidence  about  a  single  variable  is  effected  by  re-adjusting  the 
appropriate  margin  in  a  potential  table  that  contains  that  variable,  then  propagating  the 
resulting  change  to  the  clique  to  other  cliques  via  the  clique  intersections.  This  process 
continues  outward  from  the  clique  where  th  process  began,  until  all  cliques  have  been 
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updated.  The  single-connectedness  and  running  intersection  properties  of  the  join  tree 
assure  that  coherent  probabilities  result. 

Suppose  that  we  learn  the  patient  in  our  example  does  have  a  fever.  How  does  this 
change  our  beliefs  about  the  other  variables?  The  calculatitms  are  summarized  in  Table  3. 

[[Table  3  about  here]] 

•  The  process  begins  with  the  potential  table  for  Clique  1.  In  the  initial  condition,  we 
had  a  joint  probability  distribution  for  the  variables  in  this  clique,  say,  fo(FEVER, 
FLU,  THRINF).  We  now  know  with  certainty  that  FEVER=yes,  so  the  column 
for  FEVER=no  is  zeroed  oul^  Denote  the  updated  potential  table  fi  (FEVER,  FLU, 
THRINF).  One  could  re-normalize  the  entries  in  the  FEVER=yes  column  at  this 
point,  but  only  the  proportionality  information  needs  to  be  sent  on  for  updating. 

•  The  clique  intersection  table  is  updated  to  reflect  the  new  propcntional  relationships 
among  the  probabilities  for  FLU  and  THRINF,  or  fi(FLU,  THRINF). 

Normalizing  them  to  sum  to  one  would  give  probabilities.  Pi  (FLU,  THRINF), 
which  marginalize  to  .51  for  FLU=yes  and  for  THRINF=yes. 

•  The  potential  table  for  Clique  2  is  updated  by  first  dividing  all  entries  in  a  row  by 
the  value  for  that  row  in  the  original  clique  intersection  table,  then  multiplying  them 
by  the  corresponding  entries  in  the  new  one  obtained  in  the  previous  step.  The 
resulting  entries  are  proportional  to  the  new  posterior  probabilities  for  the  variables 
in  Clique  2.  We  now  examine  the  rationale  for  this  step  in  terms  of  probabilities 
(but  recall  that  it  suffices  within  the  black  box  to  simply  pass  the  correct  information 
about  proportionalities  along  the  join  tree). 
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The  initial  joint  probability  distribution  for  Clique  2,  Po(FLU,THRINF,SORTHR), 
implied  beliefs  about  flu  and  throat  infection  that  were  consistent  with  those  in  the 
initial  status  of  Clique  1.  But  inaxning  information  about  fever  modified  belief 
about  flu  and  throat  infection,  to  Pi(FLU,THRINF).  We  want  to  revise  the 
informadon  in  the  potential  table  for  Clique  2  so  that  it  is  (1)  consistent  with  the 
new  beliefs  about  flu  and  throat  infection,  but  (2)  unchanged  in  terms  of  the 
relationship  of  sore  throat  conditional  on  fever  and  throat  infection.  This  is 
accomplished  as  shown  below,  justifying  the  divide-by-old-and-multiply-by-new 
algorithm: 

P,  (FLU,THRINF,SORTHR) 

=  P(SORTHRIFLU,THRINF)  P,(FLU,THRINF) 


P(SORTHRIFLU,THRINF)  Po(FLU,THRlNF) 
Po(FLU,THRINF) 


Po(FLU,THRINF,SORTHR) 

PoCFLU.THRINF) 


P,(FLU,TEIRINF) 
P,(FLU,'raRINF). 


•  The  entries  in  the  Clique  2  potential  table  can  be  re-normed  to  sum  to  one,  as  shown 
in  the  final  panel  in  Table  3,  to  ft^tate  the  calculation  of  individual  combinations 
of  values  or  of  margins.  For  example,  the  revised  probability  for  sore  throat  is  .48. 

Application  to  Cognitive  Diagnosis 

The  approach  we  are  exploring  begins  in  a  specific  ^plication  by  defining  a 
universe  of  student  models.  This  “supermodel”  is  indexed  by  parameters  that  signify 
distinctions  between  states  of  understanding.  Symbolically,  we  shall  refer  to  the  (typically 
vector-valued)  parameter  of  the  student-model  as  t].  A  particular  set  of  values  of  t] 
specifies  a  particular  student  model,  or  one  particular  state  among  the  universe  of  possible 


Probability-Based  Inference 
Page  15 

states  the  supermodel  can  accommodate.  These  parameters  can  be  qualitative  or 
quantitative,  and  qualitative  parameters  can  be  unotdertd,  partially  ordered,  or  c(»npletely 
ordered.  A  supermodel  can  contain  any  mixture  of  these  types.  Their  nature  is  derived 
from  the  structure  and  the  psychology  of  the  learning  area,  with  the  goal  of  being  able  to 
express  essential  distinctions  among  states  of  knowledge  and  skill . 

Any  application  faces  a  modeling  problem,  a  task  construction  problem,  and  an 
inference  problem. 

The  modeling  problem  is  delineating  the  states  or  levels  of  umlerstanding  in  a 
learning  domain.  In  meaningful  applications  this  might  address  several  distinct  strands  of 
learning,  as  understanding  develops  in  a  number  of  key  concepts,  and  it  might  address  the 
connectivity  among  those  concepts.  This  substep  defines  the  structure  of  p(xlT|),  where  x 
represents  observations.  An  interesting  special  case  occurs  when  the  universe  of  student 
models  can  be  expressed  as  performance  models  (Clancey,  1986).  A  performance  model 
consists  of  a  knowledge  base  and  manipulation  rules  that  can  be  run  on  problems  in  a 
domain  of  interest.  A  particular  model  can  contain  both  knowledge  and  production  rules 
that  are  incoirea  or  incomplete;  the  solutions  it  produces  will  be  correct  or  incorrect  in 
identifiable  ways.  Here  the  parameter  q  specifies  features  of  performance  models,  such  as 
the  set  of  production  rules  that  characterizes  a  student’s  state  of  competence. 

Obviously  any  model  will  be  a  gross  simplification  of  the  reality  of  cognition.  A 
first  consideration  in  what  to  include  in  the  supermodel  is  the  substance  and  the  psychology 
of  the  domain:  Just  what  are  the  key  concepts?  What  are  important  ways  of  understanding 
and  misunderstanding  them?  What  are  typical  paths  to  competence?  A  second 
ctmsideration  is  the  so-called  grain-size  problem,  or  the  level  of  detail  at  which  student- 
models  should  differ.  A  major  factcn-  in  answering  this  question  is  the  decision-making 
framework  under  which  the  modeling  will  take  place.  As  Greeno  (1976)  points  out,  “It 
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may  not  be  critical  to  distinguish  between  models  differing  in  processing  details  if  the 
details  lack  important  implications  for  quality  of  student  peifmnance  in  instructional 
situaticms,  or  the  ability  of  students  to  progress  to  further  stages  of  knowledge  and 
understanding.” 

An  analog  for  the  student  nxxlel  space  is  Smith  &  Wesson’s  “Identikit,”  which 
helps  police  construct  likenesses  of  suspects.  Faces  differ  in  infinitely  many  ways,  and 
skilled  police  artists  can  sketch  infinitely  many  drawings  to  match  wimesses’  recollections 
(which  is  not  the  say  that  police  artists’  drawings  duplicate  suspects’  faces  perfectly; 
uncertainty  enters  in  the  link  through  the  wimess).  Departments  that  can’t  support  an  artist 
use  an  Identikit,  a  collection  of  various  face  shapes,  noses,  ears,  hair  styles,  and  so  on,  that 
can  be  combined  to  approximate  witnesses’  recollections  from  a  large,  but  finite,  range  of 
possibilities.  The  payoff  lies  not  in  how  accurately  the  Identikit  composite  depicts  the 
suspect,  but  whether  it  aids  the  search  enough  to  justify  its  use. 

Research  relevant  to  constructing  student  models  has  been  carried  out  in  a  wide 
variety  of  fields,  including  cognitive  psychology,  the  psychology  of  mathematics  learning 
and  science  learning,  and  artificial  intelligence  (AI)  work  on  student  modeling.  Cognitive 
scientists  have  suggested  general  structures  such  as  “frames”  or  “schemas”  that  can  serve 
as  a  basis  for  modeling  understanding  (e.g.,  Minsky,  1975;  Rumelhan,  1980),  and  have 
begun  to  devise  tasks  that  probe  their  features  (e.g.,  Marshall,  1989, 1993).  Researchers 
interested  in  the  psychology  of  learning  in  subject  areas  such  as  proportional  reasoning 
have  focused  on  identifying  key  concepts,  studying  how  they  are  typically  acquired  (e.g., 
in  mechanics,  Clement,  1982;  in  ratio  and  proportional  reasoning,  Karplus,  Pulos,  & 

Stage,  1983),  and  constructing  observational  settings  that  allow  one  to  infer  students’ 
understanding  (e.g.,  van  den  Heuvel,  1990;  McDermott,  1984).  Our  approach  can  succeed 
only  by  building  upon  foundations  of  such  research. 
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The  msk  construction  problem  is  devising  situations  for  which  students  who  differ 
in  the  paranoeter  space  are  likely  to  behave  in  observably  different  ways.  The  ctmditional 
probabilities  of  behavior  of  different  types  given  the  unobservable  state  of  the  student  are 
the  values  of  p(xlTi),  which  may  in  turn  be  modeled  in  terms  of  another  set  of  parameters, 
say  that  have  to  be  estimated.  The  p(xhi)  values  provide  the  basis  for  inferring  back 
about  the  student  state.  An  element  in  x  could  contain  a  right  or  wrtxig  answer  to  a 
multiple-choice  test  item;  it  could  instead  be  the  problem-solving  approach  regardless  of 
whether  the  answer  is  right  or  wrong,  the  quickness  of  a  responding,  a  characteristic  of  a 
think-aloud  protocol,  or  an  expert’s  evaluation  of  a  particular  aspect  of  the  performance. 
The  effectiveness  of  a  task  is  reflected  in  differences  in  conditional  probabilities  associated 
with  different  parameter  configurations,  so  a  task  may  be  very  useful  in  distinguishing 
among  some  aspects  of  student  models  but  useless  for  distinguishing  annong  others 
(Marshall,  1989). 

The  irrference  problem  is  reasoning  from  observations  to  student  models.  This  is 
where  the  inference  network  and  local  computation  come  into  play.  The  model-building 
and  item  construction  steps  define  the  relevant  variables  (the  student-model  variables  T]  and 
the  observable  variables  x)  and  provide  conditional  probabilities.  Let  p(T|)  represent 
expectations  about  T|  in  a  population  of  interest — ^possibly  non-informative,  possibly  based 
on  expert  opinion  or  previous  analyses.  Together,  p(T])  and  p(xlT|)  imply  our  initial 
expectations  for  what  we  might  observe  from  a  student  Once  we  make  actual 
observations,  we  can  revise  our  probabilities  through  the  network  to  draw  inferences  about 
T)  given  X,  via  p(Tilx) «  p(xlTi)  p(Tl).  Thus  p(Tilx)  characterizes  belief  about  a  particular 
student’s  model  after  having  observed  a  sample  of  the  student’s  behavior. 
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Example:  Mixed-Number  Subtraction 

This  example  illustrates  a  model  that  is  aimed  at  the  level  of  short-term  instructional 
guidance.  The  form  of  the  evidence  being  collected  is  traditional — bright  ot  wrong 
respcmses  to  open-ended  mixed-number  subtraction  problems — but  inferences  are  carried 
out  in  a  student  model  motivated  by  cognitive  analyses  of  the  domain.  It  concerns  which  of 
two  strategies  students  apply  to  the  problems,  and  whether  they  are  able  to  cairy  out 
procedures  required  singly  or  in  combination  in  problems.  Although  a  much  finer  grain- 
size  can  be  entertained  for  models  of  these  types  of  skills  (e.g.,  VanLehn’s  1990  analysis 
of  whole  number  subtraction),  this  example  incorporates  the  fact  that  whether  an  item  is 
easy  or  hard  to  a  given  student  depends  in  part  on  the  strategy  she  employs.  Rather  dian 
being  discarded  as  noise,  as  it  would  be  under  standard  test  theory,  this  interaction  is 
exploited  by  the  analytic  model  as  a  source  of  evidence  about  a  student’s  strategy  usage. 

The  data  and  the  cognitive  analysis  upon  which  the  student  model  is  grounded  are 
due  to  Kikumi  Tatsuoka  (1987, 1990).  The  middle-school  students  she  studied 
characteristically  solve  mixed  number  subtraction  problems  using  one  of  two  strategies; 

Method  A:  Convert  all  whole  and  mixed  numbers  to  improper  fractions,  subtract,  then 
reduce  if  necessary. 

Method  B;  Separate  mixed  numbers  into  whole  number  and  fractional  parts,  subtract  as 
two  subpioblems,  borrowing  one  from  minuend  whole  number  if 
necessary,  then  reduce  if  necessary. 

We  analyzed  530  students’  responses  to  15  items.  Table  4  shows  how  we 
characterized  each  item  in  terms  of  which  of  seven  subprocedures  would  be  required  if  it 
were  solved  with  Method  A  and  which  would  be  required  if  it  were  solved  with  Method  B. 
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The  student  model  is  comprised  of  a  variable  for  which  strategy  a  student  uses  and  which 
of  the  subprocedures  he  is  able  to  apply.  The  structure  connecting  the  unobservable 
parameters  of  the  student  model  and  the  observable  responses  is  that  ideally,  a  student 
using  Method  X  (A  or  B,  as  apprcqniate  to  that  student)  would  conectly  answer  items  that 
under  that  strategy  require  only  subprocedures  die  student  has  at  his  disposal  (see 
Falmagne,  1989,  Tatsuoka,  1990,  and  Haertel  &  V/iley,  1993,  on  models  of  this  type). 
However,  sometimes  students  miss  items  even  under  these  conditions  (false  negatives), 
and  sometimes  they  correctly  answer  items  when  they  don’t  possess  the  subproceduies  by 
other,  possibly  incorrect,  means  (false  positives).  The  connection  between  observations 
and  student  model  variables  is  thus  probabilistic  rather  than  deterministic. 

[[Table  4  about  here]] 

A  network  for  Method  B 

Figure  5  is  a  graphic  depiction  of  the  structural  relationships  in  an  inference 
network  for  Method  B  only.  Nodes  represent  variables,  and  arrows  represent  dependence 
relationships.  The  joint  probability  distribution  of  all  variables  can  be  represented  as  the 
product  of  conditional  probabilities,  with  each  variable  expressed  in  terms  of  conditional 
probabilities  given  its  “parents.”  Five  nodes,  “SkilH”  through  “Skills,”  represent  basic 
subprocedures  that  a  student  who  uses  Method  B  might  need  use  to  solve  items. 

Additional  nodes,  such  as  “Skillsl&2”  are  conjunctions,  representing,  for  example,  either 
having  or  not  having  both  Skill  1  and  Skill  2.  The  node  MN  stands  for  “mixed  number 
skills.”  It  subsumes  both  Skill3,  separating  whole  numbers  from  fractions,  and  Skill4, 
borrowing  a  unit  from  a  whole  number,  the  MN  node  contains  the  logical  relationship  that 
Skill3  is  a  prerequisite  for  Skill4.  All  of  these  skill  variables  and  their  combinations  are 
represented  in  Figure  5  by  rectangles.  They  are  the  elements  of  the  student  model,  or  T|. 
The  relationships  among  the  skill  nodes  are  either  empirical  (probabilities  of  having,  say. 
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Skill  2  given  that  one  does  or  does  not  have  Skill  1)  or  logical  (one  has  “Skillsl&2”  only  if 
one  has  both  Skill  1  and  Skill  2). 


[[Figure  5  about  here]] 

The  observables,  x,  are  the  actual  test  items.  The  ovular  nodes  representing  items 
are  children  of  nodes  that  represent  the  minimal  necessary  conjuncticm  of  skills  necessary  to 
solve  that  item  if  one  uses  Method  B.  The  relationship  between  such  a  node  and  an  item  is 
probabilistic,  indicating  false  positive  and  false  negative  probabilities. 

Cognitive  theory  inspired  the  structure  of  this  network.  Initial  estimates  of  the 
numerical  values  of  conditional  probability  relationships  were  approximated  using  results 
from  Tatsuoka’s  (1983)  “rule  space”  of  the  data,  with  only  students  she  classified  as 
Method  B  users.  That  is.  Dr.  Tatsuoka’s  estimate  of  whether  a  student  did  or  did  not 
possess  Skill  1  and  Skill2  were  taken  as  truth,  and  our  probabilities  of  students  having 
Skill  1,  of  having  Skill2  given  that  they  did  or  didn’t  have  Skill2,  and  so  on,  are  empirical 
proportions  from  this  data  set.  (Duanli  Yan  and  I  are  exploring  the  estimation  of  these 
conditional  probabilities  using  the  EM  algorithm  of  Demps^,  Laird,  &  Rubin,  1977.) 
Table  5  gives  three  examples  of  the  conditicmal  probabilities  matrices  we  used  as  input  to 
HUGIN  and  ERGO: 

•  Skill2  given  Skill  1 .  These  are  the  comlitional  probabilities  of  having  or  not  having 
Skill2,  given  that  a  student  does  or  does  not  have  Skill  1.  These  were  approximated 
from  the  results  of  Dr.  Tatsuoka’s  analysis,  as  described  above. 

•  Skills  1&2  given  Skill  1  and  Skill2.  This  is  a  logical  relationship,  indicating  that  a 
student  has  the  conjunction  of  Skills  1  and  2  if  and  only  if  she  has  both  Skill  1  and 
Skill2. 
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•  Itenil2  given  Skillsl&2.  This  matrix  gives  the  probabilities  of  correctly  answering 
Item  12,  given  that  a  student  does  or  does  not  have  the  requisite  set  of  skills  under 
Method  B.  For  the  row  in  which  Skillsl&2  is  true,  we  have  the  true  positive  and 
false  negative  success  rates,  .895  and  .105  respectively.  For  the  row  in  which 
Skills  1&2  is  false,  we  have  the  false  positive  and  true  negative  rates,  .452  and 
.548.  (A  relatively  high  false  positive  rate  such  as  this  often  occur  when  an  item  on 
a  test  has  appeared  as  a  textbook  example  or  homework  exercise.) 

[[Table  5  about  here]] 

Figure  6  presents  the  join  tree  for  the  DAG  depicted  in  Figure  5.  Figtne  7  depicts 
base  rate  probabilities  of  skill  possession  and  item  peicents-correct  in  the  network  with 
empirical  associations,  using  the  conditional  probatrilities  from  Tatsuoka’s  Rule  Space 
analysis.  This  represents  the  state  of  knowledge  one  has  about  a  student  knowing  that  she 
uses  Method  B,  but  without  having  observed  any  item  responses.  Figure  8  shows  how 
beliefs  are  changed  after  observing  mostly  correct  answers  to  items  requiring 
subprocedures  other  than  Skill  2,  but  missing  most  of  those  that  do  require  it.  The  base- 
rate  and  the  updated  probabilities  for  the  five  skills  shown  in  Table  6  show  substantial 
shifts  toward  the  belief  that  the  student  commands  Skills  1,  3, 4,  and  possibly  5,  but 
almost  certainly  not  Skill  2. 

[[Figures  6-8  about  here]] 

[[Table  6  about  here]] 

A  simplified  network  for  Method  B 

An  alternative  representation  exemplifies  the  tradeoffs  one  faces  when  building 
more  complex  networks,  and  illustrates  their  relationships  to  the  network  building  and 
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manipulation  steps  discussed  above.  A  simpler  network  results  if  empirical  relationships 
among  skills  are  deleted,  as  shown  in  Figure  9.  The  resulting  join  tree  is  shown  in  Figure 
10.  The  advantage  of  this  simpler  network  is  a  join-tree  with  smaller  maximally-sized 
clique,  containing  4  variables  rather  than  5.  The  largest  potential  table  has  only  16  entries, 
rather  than  48.  By  such  simplificadons,  larger  networks  of  variables  can  be  updated  in  the 
same  amount  of  calculating  time.  The  simpler  network  uses  only  direct  information  from 
item  responses  to  update  beliefs  about  skill  possession;  that  is,  belief  for  Skill  3  is  changed 
only  by  responses  to  items  that  require  Skill  3.  The  tradeoff  is  the  forfeiture  of  indirect 
information.  Suppose  we  have  ascenained  that  students  who  possess  Skills  1  and  2 
usually  also  possess  Skill  3.  The  full  network,  incorporating  this  link,  would  revise  our 
belief  about  Skill  3  in  response  to  indirect  evidence  in  the  form  of  correct  answers  to  items 
requiring  Skills  1  and  2.  The  simplified  network,  omitting  the  link,  would  not  revise  belief 
about  Skill  3  without  direct  evidence,  or  responses  to  items  requiring  Skill  3  itself. 

[[Figures  9  &  10  about  here]] 

What  kinds  of  inferential  errors  result  from  this  simplification?  Closed-form  results 
with  simple  models  indicate  that  ignoring  positive  relationships  among  unobservable 
variables  higher  in  the  network  can  lead  to  weaker,  or  more  conservative,  revision  of  belief 
about  them  from  observations.  This  may  be  an  acceptable  price  in  some  cases  in  return  for 
being  able  to  incorporate  more  variables  into  a  network.  (On  the  other  hand,  ignoring 
dependencies  among  observable  variables  can  lead  to  overly  strong  updating — generally  a 
more  costly  error.)  A  second  rationale  for  omitting  the  empirical  relationships  among 
skills  is  that  the  resulting  model,  while  conservative  for  a  given  population,  may  be  more 
transportable  to  other  populations — ^for  example,  students  who  studied  fractions  under  a 
different  curriculum.  While  the  skill  requirements  of  items  may  be  fairly  consistent  over 
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students,  the  relationships  among  skills  may  depend  more  heavily  on  the  order  and 
intensity  with  which  they  are  studied. 

A  simultaneous  network  for  both  methods 

We  built  a  similar  network  for  Method  A.  Figure  1 1  incorporates  it  and  the  Method 
B  network  into  a  single  network  that  is  appropriate  when  we  don’t  know  which  strategy  a 
student  is  using.  Each  item  now  has  three  parents:  minimally  sufficient  sets  of 
subprocedures  under  Method  A  and  under  Method  B,  and  the  new  node  “Is  the  student 
using  Method  A  or  Method  B?”  An  item  like  7|  -  5y  is  hard  under  Method  A  but  easy 
under  Method  B.  An  item  like  2^-1^  is  just  the  opposite.  A  response  vector  with  most  of 

the  first  type  of  items  right  and  the  second  types  wrong  shifts  belief  toward  the  use  of 
Method  B,  while  the  opposite  pattern  shifts  belief  toward  the  use  of  Method  A.  A  pattern 
with  mostly  wrong  answers  gives  posterior  probabilities  for  Method  A  and  Method  B  that 
are  about  the  same  as  the  base  rate,  but  low  probabilities  for  possessing  any  of  the  skills. 
We  haven’t  learned  much  about  which  strategy  such  a  student  is  using,  but  we  do  have 
evidence  that  he  probably  doesn’t  have  subprocedure  skills.  Similaiiy,  a  pattern  with 
mosdy  right  answers  again  gives  posterior  probabilities  for  Method  A  and  Method  B  that 
are  about  the  same  as  the  base  rate,  but  high  probabilities  for  possessing  all  of  the  skills.  In 
any  of  these  cases,  the  results  could  be  used  to  guide  an  instructional  decision. 

[[Figure  1 1  about  here-network  for  both  methods]] 

Extensions 

This  example  could  be  extended  in  many  ways,  both  as  to  the  nature  of  the 
observations  and  the  nature  of  the  student  model.  With  the  present  student  model,  one 
might  explore  additional  sources  of  evidence  about  strategy  use:  monitoring  response 
times,  tracing  solution  steps,  or  simply  asking  the  students  to  describe  their  solutions! 
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Each  has  tradeoffs  in  terms  of  cost  and  evidential  value,  and  each  could  be  sensible  in  some 
applications  but  not  others.  An  important  extension  of  the  student  model  would  be  to  allow 
for  strategy  switching  (Kyllonen,  Lohman,  &  Snow,  1984).  Adults,  for  example,  often 
decide  whether  to  use  Method  A  or  Method  B  for  a  given  item  only  after  gauging  which 
would  be  easier  to  apply.  The  variables  in  this  m(»e  complex  student  model  would  express 
the  tendencies  of  a  student  to  employ  various  strategies  under  various  conditions;  students 
would  then  be  mixtures  in  and  of  themselves,  with  “always  use  Method  A”  and  “always 
use  Method  B”  as  extreme  cases.  Mixture  problems  are  notoriously  hard  statistical 
problems;  carrying  out  inference  in  the  context  of  this  more  ambitious  student  model  would 
certainly  require  the  richer  information  mentioned  above.  Anne  B61and  and  I  (Beland  & 
Mislevy,  1992)  tackled  this  problem  in  the  domain  of  proportional  reasoning,  addressing 
students’  solutions  to  balance-beam  tasks.  We  modeled  students  in  terms  of  neo-Piagetian 
developmental  stages  based  on  the  availability  of  certain  concepts  that  could  be  fashioned 
into  strategies  for  different  kinds  of  tasks.  The  data  for  inferring  a  students’  stages  were 
their  solutions  and  their  explanations  of  the  strategies  they  employed. 

Conclusion 

Inference  netwoiic  models  can  play  useful  roles  in  educational  assessment.  One  is 
the  use  mentioned  in  our  example,  namely,  cognitive  diagnosis  for  short  term  instructional 
guidance  as  in  an  intelligent  tutoring  system  (ITS).  At  ETS,  we  are  currently  working  to 
implement  probability-based  inference  updating  the  student  model  in  an  aircraft  hydraulics 
rrS  (Gitomer,  Steinberg,  &  Mislevy,  in  press).  Another  is  mapping  out  the  evidential 
structure  of  observations  and  student  knowledge  structures  (Haertel,  1989;  Haertel  & 
Wiley,  1993).  As  both  models  and  observational  contexts  become  more  complex,  more 
careful  thought  is  required  to  son  out  and  characterize  the  implications  and  qualities  of 
assessment  tasks  if  we  are  to  use  the  information  effectively.  We  plan  to  explore  the  kinds 
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of  proUems  in  which  the  approach  outlined  above  proves  efficacious,  and  to  develop 
exenq)lars  and  methodological  tools  for  employing  it. 


Probability-Based  Inference 
Page  26 


Notes 


1  This  terminology  is  from  the  use  of  DAGs  in  pedigree  analysis,  where  nodes  represent 
characteristics  of  animals  that  are  in  fact  parents  and  children. 


2  Partial  information,  such  as  “based  on  a  reading  from  an  unreliable  thermometer.  I’d 
place  the  probability  of  fever  is  .80,”  would  lead  to  proportional  re-adjustment  of  the 
columns,  maintaining  the  proportional  relationships  within  columns. 
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Tabid 

Conditional  Probabilities  of  Syn^toms  Given  Disease  States 


FLU 

THRINF 

P(SORETHR>=yes 

P(SORETHR)=no 

yes 

yes 

.91 

.09 

yes 

no 

.05 

.95 

no 

yes 

.90 

.10 

no 

no 

.01 

.99 

FLU 

THRINF 

P(FEV)=yes 

P(FEV)=no 

yes 

yes 

.99 

.01 

yes 

no 

.90 

.10 

no 

yes 

.90 

.10 

no 

no 

.01 

.99 

no 


Table  2 


Potential  Tables  for  Initial  Status  of  Knowledge 


Clique  1 


Tables 

Potential  Tables  after  “FEVER=yes” 


FLU 

THRINF 

FEVER:  yes  FEVER:  no 

yes 

yes 

.012 

0 

yes 

no 

.088 

0 

no 

yes 

.088 

0 

no 

no 

.008 

0 

FLU  THRINF 

Probability 

yes 

yes 

.012 

yes 

no 

.088 

no 

yes 

.088 

no 

no 

.008 

Clique  2 

FLU 

THRINF 

SORTHR: 

yes  SORTHR:  no 

yes 

yes 

yes 

no 

no 

yes 

.080 

no 

no 

.000 

^Rc^NOTmedT^W^b^^^u^ 


FLU 

yes 

yes 

no 


THRINF 


SORTHR: 


yes 

no 


.059 

.020 

.406 

.000 


yes 


SORTHR:  no 
.005 
.426 
.046 
.041 


no 


yes 

no 


Table  4 


Skil]  Requirements  for  Fractions  Items 


If  Method  A  used 

If  Method  B  used 

Item# 

Text 

1 

2  5  6  7 

2  3  4  5 

4 

X 

X 

X 

X 

X 

6 

4-4= 

X 

7 

3-2^  = 

X 

X 

X 

X 

X 

X  X 

8 

4-f= 

X 

9 

n-2= 

X 

X 

X 

X 

X 

X 

10 

4A-2t4  = 

X 

X 

X 

X 

X 

X 

11 

4J-2f  = 

X 

X 

X 

X 

X 

X 

12 

11  _  1  - 
T  J- 

X 

X 

X 

14 

3j-3i  = 

X 

X 

X 

15 

2-4= 

X 

X 

X 

X 

X  X 

16 

1 

It 

X 

X 

X 

X 

17 

7|-f  = 

X 

X 

X 

X 

X 

18 

X 

X 

X 

X 

X 

X 

X 

19 

7-lf  = 

X 

X 

X 

X 

X 

X 

X 

X  X 

20 

44-14  = 

X 

X 

X 

X 

X 

X 

X 

Basic  fraction  subtraction 
Simplify/Reduce 

Separate  whole  number  from  fraction 
Borrow  one  from  whole  number  to  fraction 
Convert  whole  number  to  fraction 
Convert  mixed  number  to  fraction 
Column  borrow  in  subtraction 


Tables 


Examples  of  Conditional  Probability  Matrices  for  Method  B  Network 


Skill  2  Probabilities 

Skill  1  Status 

Yes 

No 

Yes 

.662 

.338 

No 

.289 

.711 

Skillsl&2  eiven  Sldlll.  SkiU2 

Skills  1&2  Probabilities 

Skill  1  Status  Skill  2  Status 

Yes 

No 

Yes  Yes 

1 

0 

Yes  No 

0 

1 

No  Yes 

0 

1 

No  No 

0 

1 

Item  12  given  Skills  1&2 

Item  12  Probabilities 

Skills  1&2  Status 

Correct 

Incorrect 

Yes 

.895 

.105 

No 

.452 

.548 

Table  6 


Prior  and  Posterior  Probalnlities  of  Subprocedure  Profile 


Skill(s) 

Prior  Probability 

Posterior  Probability 

1 

.883 

.999 

2 

.618 

.056 

3 

.937 

.995 

4 

.406 

.702 

5 

.355 

.561 

1  &2 

.585 

.056 

1&3 

.853 

.994 

1,3,  &  4 

.392 

.702 

2,  3,  &4 

.335 

.007 

1,3,4,&5 

.223 

.492 

1,  2,  3,  4,  &  5 

.200 

.003 

Figure  1 

Directed  Acyclic  Graph  Representation 


Figure  2 

Undirected,  Triangulated  Graph  Representation 


Figure  3 
Clique  Structure 


FLU, 

THRINF, 

SORTHR 


Figure  6 

Join  Tree  for  Network  with  Empirical  Connections  Among  Skills 


Skills 


Item4 


ItemZB 


itemlS 


Iteml  8 


ItemlS 


Note:  Bar  represent  probabilities,  summing  to  one  for  ail  the  possible  values  of  a  variaUe.  A 
shaded  bar  extending  to  one  rqiresents  certainty,  due  to  having  observed  value  of  that  variable. 


Figure  8 

Posterior  Probabilities  for  Method  B  Following  Item  Responses 


Item  9^3  4/s  ~3  2/5) 
Item  14 


11/8  •  1/8 

Item  12 


SkiUs 

l.Z3,&4 


[31/2  -2  3/2)  /(4  1/3  -  2  4/33\  (41/3-1  S/3] 
Item  4  1^  Item  11  ^  Item  20 

(44/12-2  7/12)  (4  mo  -  2  8/10) 

Item  10  Item  18 


7  3/5  -  4/5 

Item  17 


SldUs  1.  2,  3. 
4.  &S 


3-21/5 
Item  7 


4-3  4/3 
Item  19 


Figure  9 

A  Reduced  Inference  Network  for  Method  B 


Skills  1.  3,  4, 
&S 

_ _ 2 

r 

Oi 

:/3  ) 

Item  15 


Figure  10 

Join  Tree  for  Network  without  Empirical  Connections  Among  Skills 


Figure  1 1 

Prior  Probabilities  in  Inference  Network  for  Both  Methods  Combined 
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